Trees





Kerry Back

A decision tree

Prediction in each cell is the plurality class (for classification) or the cell mean (for regression).

Another example

Splitting criterion for classification

  • In each cell, prediction is class with most representation.
  • Each observation of other classes is an error.
  • Try to create “pure” classes.
  • Perfect purity means each cell contains only one class
    \(\Rightarrow\) no errors.

Splitting criterion for regression

  • In each cell, prediction is mean.
  • Usually try to minimize sum of squared errors.
  • Algorithm will try to find splits that separate outliers into their own cells.
  • To avoid dependence on outliers,
    • Minimize sum of absolute errors instead, or
    • Choose target variable that does not have outliers

Data

  • Monthly data in SQL database
  • 100+ predictors described in ghz-predictors.xlsx

SQL

  • select [columns or operations on columns] from [table]
  • join [another table] on [variables to match on]
  • where [select rows based on conditions]
  • order by [columns to sort on]

Connect with python

from sqlalchemy import create_engine
import pymssql
import pandas as pd

server = "mssql-82792-0.cloudclusters.net:16272"
username = "user"
password = "" # paste password between quote marks
database = "ghz"
string = "mssql+pymssql://" + username + ":" + password + "@" + server + "/" + database
conn = create_engine(string).connect()

Example: ROEQ and mom12m in 2021-01

data = pd.read_sql(
    """
    select ticker, date, ret, roeq, mom12m
    from data
    where date='2021-01'
    """, 
    conn
)
data = data.dropna()

data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2542 entries, 0 to 2547
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ticker  2542 non-null   object 
 1   date    2542 non-null   object 
 2   ret     2542 non-null   float64
 3   roeq    2542 non-null   float64
 4   mom12m  2542 non-null   float64
dtypes: float64(3), object(2)
memory usage: 119.2+ KB

Fit a classification tree

from sklearn.tree import DecisionTreeClassifier

data['class'] = data.ret.transform(
  lambda x: pd.qcut(x, 3, labels=(0, 1, 2))
)
X = data[["roeq", "mom12m"]]
y = data["class"]

model = DecisionTreeClassifier(
  max_depth=2, 
  random_state=0
)
model.fit(X, y)

View the classification tree

from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

plot_tree(model)
plt.show()

Confusion matrix

from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(model, X=X, y=y)
plt.show()

Predicted class probabilities

  • Three of the four leaves have a plurality of High, so all observations in those leaves get a prediction of High.
  • But the three leaves are not the same.
  • The fraction of Highs in a leaf is the probability that an observation in the leaf is High. The probabilities are
    • 53/69 = 77%
    • 315/695 = 45%
    • 409/1664 = 25%
    • 70/114 = 61%

Fit a regression tree

from sklearn.tree import DecisionTreeRegressor

X = data[["roeq", "mom12m"]]
y = data["ret"]

model = DecisionTreeRegressor(
  max_depth=2,
  random_state=0
)
model.fit(X, y)

View the regression tree

plot_tree(model)
plt.show()

Which are the low ROE, high MOM stocks?

subset = data[
  (data.roeq<=-0.181) & (data.mom12m>1.672)
]
subset
ticker date ret roeq mom12m class
28 KSPN 2021-01 1.635680 -1.526803 3.986288 2
136 PACB 2021-01 0.247109 -0.382200 2.075876 2
162 IBIO 2021-01 0.523810 -0.892083 3.986288 2
241 CLIR 2021-01 0.361775 -0.182654 2.357377 2
248 CRDF 2021-01 -0.344080 -0.671333 3.986288 0
249 IDEX 2021-01 0.944724 -0.793602 2.237116 2
397 IPWR 2021-01 1.095471 -0.270806 2.908696 2
425 TRVN 2021-01 0.018692 -0.216079 1.984897 1
458 LPCN 2021-01 0.301471 -0.972074 3.156924 2
528 MARA 2021-01 0.986590 -0.681703 3.986288 2
564 W 2021-01 0.205970 -0.233803 1.814651 2
599 TCON 2021-01 -0.218803 -1.526803 3.290598 0
602 NVTA 2021-01 0.184406 -0.510230 2.078116 2
632 ONCS 2021-01 0.195349 -0.413288 1.872928 2
1209 NEON 2021-01 0.158519 -0.304381 2.943299 2
1724 DPW 2021-01 0.066667 -0.555331 3.986288 1
1871 INO 2021-01 0.440678 -0.698240 2.703030 2
2023 AWH 2021-01 0.329359 -0.722558 3.986288 2
2086 GME 2021-01 16.250530 -0.255172 1.723684 2
2388 OPTT 2021-01 0.490706 -0.322596 2.310345 2
2413 RIOT 2021-01 0.207769 -0.318611 3.986288 2
2469 CAPR 2021-01 0.918367 -0.363181 2.281250 2
2540 GNMK 2021-01 -0.054110 -0.360225 1.779626 0

Predicting ranks

data['rnk'] = data.ret.rank(pct=True)

X = data[["roeq", "mom12m"]]
y = data["rnk"]

model = DecisionTreeRegressor(
  max_depth=2,
  random_state=0
)
model.fit(X, y)

View the regression tree for ranks

plot_tree(model)
plt.show()

Predicting numerical classes

X = data[["roeq", "mom12m"]]
y = data["class"]

model = DecisionTreeRegressor(
  max_depth=2,
  random_state=0
)
model.fit(X, y)

View the regression tree for classes

plot_tree(model)
plt.show()